- Feed-forward neural networks
- Recurrent neural networks
- SRN
- LSTM
- Bi-LSTM
- GRU
A machine learning subfield of learning representations of data. Exceptional effective at learning patterns.
Deep learning algorithms attempt to learn (multiple levels of) representation by using a hierarchy of multiple layers.
\[h = \sigma(W_1x + b_1)\] \[y = \sigma(W_2h + b_2)\]
Optimize
objective/cost function \(J\)\((\theta)\)
Generate
error signal that measures difference between predictions and target values
Use error signal to change the
weights and get more accurate predictions
Subtracting a fraction of the gradient moves you towards the (local) minimum of the cost function
objective/cost function \(J\)\((\theta)\)
Update each element of \(\theta\):
\[\theta^{new}_j = \theta^{old}_j - \alpha \frac{d}{\theta^{old}_j} J(\theta)\]
Matrix notation for all parameters ( \(\alpha\): learning rate):
\[\theta^{new}_j = \theta^{old}_j - \alpha \nabla _{\theta}J(\theta)\]
Recursively apply chain rule though each node
Learned hypothesis may fit the training data very well, even outliers ( noise) but fail to generalize to new examples (test data)
How to avoid overfitting?
Suppose we had the following scenario:
Day 1: Lift Weights
Day 2: Swimming
Day 3: At this point, our model must decide whether we should take a rest day or yoga. Unfortunately, it only has access to the previous day. In other words, it knows we swam yesterday but it doesn’t know whether had taken a break the day before. Therefore, it can end up predicting yoga.
\[f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)\]
\[i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)\] \[\tilde{C}_t = tanh(W_C \cdot [h_{t-1}, x_t] + b_C)\]
\[C_t = f_t * C_{t-1} + i_t * \tilde{C}_t\]
\[o_t = \sigma(W_o[h_{t-1}, x_t] + b_o)\] \[h_t = o_t * tanh(C_t)\]
POS Tagging
https://www.aclweb.org/aclwiki/index.php?title=POS_Tagging_(State_of_the_art)\[z_t = \sigma(W_z \cdot [h_{t-1}, x_t])\] \[r_t = \sigma(W_r \cdot [h_{t-1}, x_t])\] \[\tilde{h}_t = tanh(W \cdot [r_t * h_{t-1}, x_t])\] \[h_t = (1 - z_t) * h_{t-1} + z_t * \tilde(h)_t\]